Instructions

Below you will find several empty R code scripts and answer prompts. Your task is to fill in the required code snippets and answer the corresponding questions.

Cereal Data

Today, we start by looking at a collection of breakfast cereals:

With variables:

Produce a histogram of the sugar variable.

Now, compute the standard deviation of the variable sugar:

## [1] 4.378656

What are the units of this measurement?

Answer: grams

Now, compute the deciles of the variable score:

##   0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
## 18.0 28.0 31.0 34.5 37.0 40.0 42.0 48.0 53.0 58.0 84.0

What is the value of the 30th percentile. Describe what this means in words:

Answer: The value is 34.5. This mean that 30% of the cereals have a healthiness score that is less than 34.5, and 70% of the cereals have a healthiness score that is greater than 34.5.

Produce a boxplot of score and brand.

Which brand seems to have the healthiest cereals?

Answer: Nabisco

Produce a boxplot of score and shelf.

Produce a boxplot of sugar and shelf.

If I want a healthy but reasonably sweet cereal which shelf would be the best to look on?

Answer: The top shelf.

Tea Reviews

Next, we will take another look at a dataset of tea reviews that I used in a previous lecture:

With variables: - name: the full name of the tea - type: the type of tea. One of: - black - chai - decaf - flavors - green - herbal - masters - matcha - oolong - pu_erh - rooibos - white - score: user rated score; from 0 to 100 - price: estimated price of one cup of tea - num_reviews: total number of online reviews

Draw a scatterplot with num_reviews (x-axis) against score (y-axis) and add a regression line (recall: geom_smooth(method="lm")).

Does the score tend to increase, decrease, or remain the same as the number of reviews increases?

Answer: The score tends to increase.

Calculate the ventiles of the variable price.

##     0%     5%    10%    15%    20%    25%    30%    35%    40%    45% 
##   8.00  10.00  10.00  10.00  10.00  10.00  12.00  12.00  12.00  12.00 
##    50%    55%    60%    65%    70%    75%    80%    85%    90%    95% 
##  13.00  15.00  15.00  17.00  19.00  20.00  30.00  35.35  49.30  86.75 
##   100% 
## 196.00

What is the 80th percentile? Describe it in words, include the units of the problem in your answer.

Answer: The 80th percentile is 30.00. The price is in dollars. This number means that 80% of the tea costs below $30, and 20% of the tea costs above $30.

Plot the number of reviews (x-axis) against the score variable. Color the points according to price binned into 5 buckets.

What tends to be true about the number of reviews for the most expensive 20% of teas?

Answer: There are not many reviews for the most expensive teas.

Create a dataset named white that consists of only white teas.

Calculate the standard deviation of the price for white teas and the standard deviation of the price for all of the teas.

## [1] 13.59444
## [1] 30.42485

Is the variation of the white tea prices smaller, larger, or about the same as the entire dataset?

Answer: The variation of the white tea prices is smaller than the variation of the prices in the entire tea dataset.

Summarize the dataset by the type of tea and save the results as a variable named tea_type.

Plot the average price (x-axis) against the average score (y-axis) of each type of tea. Make the size of the points proportional to the number of teas in each category and label the points with geom_text_repel and the tea type.

Describe an interesting pattern or set of outliers that you found in the previous plot. This does not need to take more than 1-2 sentences.

Answer: One interesting outlier is the masters tea. This tea is interesting because there is a medium amount of tea in this group, but it is also the most expensive tea by far than the rest, and it also has the highest number of reviews. Matcha tea is another interesting outlier, because it too has a higher price than the majority of the other types, but its reviews are lower, which leads me to believe that maybe it is not as popular as the other teas, including masters.